Please NOTE that because my project works with PHI and identifiable data from the Penn Medicine EHR(with IRB approval), much of my code will use the option results='hide' in the R code chunks, and therefore results of the code will be hidden.

Overview

This project explores the geographical phenomena and processes of SARS-CoV-2(Covid-19) among patients of the University of Pennsylvania Health System(UPHS), which serves the Greater Philadelphia Metropolitan Area. We dive into a geospatial analysis of Covid-19 and explore how the pandemic has affected certain populations of the area, specifically looking at demographic, socioeconomic, clinical and environmental correlations with outpatient outcomes like testing(especially positive tests).

Contributors to the project include:

  • Blanca Himes, PhD
  • Danielle Mowery, PhD
  • Sherry Xie, VMD-PhD candidate
  • Alexandra Rizaldi, BS candidate
  • Emily Schriver, MS

Introduction

The Covid-19 pandemic has affected millions on a global scale, having a prolonged, devastating impact on public health and economies in practically every part of the world. It has been no different in Philadelphia, where the infection cases grew rapidly in waves in 2020, but also disproportionately affected communities of color, with predominantly African Americans communities experiencing the highest rates of disease severity and mortality. Examining Covid-19 in Philadelphia offers a unique opprotunity and perspective as one of the United States’ most racially and ethnically diverse metropolitan cities. One of the ways in which we can help improve our current understanding of the pandemic is by conducting a multidisciplinary study on how the disease affected the city over time. Mapping this space-time progression of the disease can highlight the level of spread and factors that facilitated the spread, as well as bring to light the behavioral, socioeconomic and environmental factors that correlated with outcomes of the disease. The hope is that this analysis can further be used in the future for determining risk factors, ameliorating resource allocation issues and contributing to improved disease surveillance.

Data

This is a restrospective study examining UPHS EHR patients with at least one primary diagnosis ICD-10 code of flu-like symptoms or pneumonia, or patients who had one procedure code for Covid-19 or influenza testing from January 2020 to January 2021. The cohort consists of 378,601 patients, but due to missing data points the cohort was reduced to 172,284 patients. The study also incorporates publicly available data such as US Census Bureau data, community health data from the Public Health Management Corporation, and geocode data from geocod.io

Methods

The primary method of analysis will be divided into two parts:

  1. Visually study the clusters of Covid-19 cases according to different demographic and socioeconomic variables.
  2. Perform logistic regression to determine the demographic, socioeconomic and clinical factors that are associated with adverse health outcomes of Covid-19.

Please note that due to a lot of the data being protected health information under HIPAA, some code chunks results will be hidden. Any code chunk where results were hidden will start with the comment “Results are hidden due to PHI.” followed by a reason why it was hidden

library(tidyverse)
library(tidyr)
library(MapGAM)
library(splancs)
library(raster)
library(RColorBrewer)
library(dplyr)
library(ggplot2)
library(sp)
library(rgdal)
library(sf)
library(tidycensus)
library(ggsn)
library(mapview)
library(leaflet)
library(GISTools)
library(tmap)
library(naniar)
library(rgeos)
library(spatstat)
library(yaml)
library(visdat)
source("helper_funcs.R")
# Using YAML for neatness, also to hide potentially revealing filenames
files <- read_yaml("./filenames.yml")
## Warning in readLines(file): incomplete final line found on './filenames.yml'
#basic patient data as well as clinical variables
patient_data <- readRDS(files$patient_data_rds)
#output of geocod.io (i.e. longitude, latitude coordinates)
geocodes <- read.csv(files$geocodes_csv, stringsAsFactors = F)
#covid test results according to patient
lab_results <- readRDS(files$lab_results_rds)
#input to geocod.io - used mainly for data cleaning
geocodio_input <- readRDS(files$geocodio_input_rds)
#Polygons for Southeastern Pennsylvania
sepa <- readRDS(files$sepa_polygons_rds)
#Coordinates for boundaries for Counties in Southeastern Pennsylvania
sepac <- readRDS(files$sepa_counties_coords_rds)
# Block level census data for Philadelphia from Open Data Philly
phl_blocks <- st_read(files$phl_blockgroups_dir, stringsAsFactors=F)
# Area Depravation Index data by block level in Southeastern PA
adi <- read.csv(files$adi_blockgroups_txt)
# Population/Demographic/Socioeconomic data and Polygons for Philadelphia
pop <- st_read(files$phl_demographic_dir)

To start our analysis we want to create a single dataframe that captures all the pertinent information in patID, lab_results and geocodes and patient_data.

class(geocodes)
## [1] "data.frame"
colnames(geocodes)
##  [1] "X"              "PAT_ID_DEID"    "street"         "city"           "state"          "zip"            "Latitude"       "Longitude"      "Accuracy.Score" "Accuracy.Type"  "Number"         "Street"         "Unit.Type"      "Unit.Number"    "City"           "State"          "County"         "Zip"            "Country"        "Source"
class(geocodio_input)
## [1] "data.frame"
colnames(geocodio_input)
## [1] "PAT_ID_DEID" "street"      "city"        "state"       "zip"
options(scipen=1000000)

summary(as.character(geocodes$PAT_ID_DEID)) #429435
##    Length     Class      Mode 
##    429435 character character
summary(as.character(geocodio_input$PAT_ID_DEID)) #429435
##    Length     Class      Mode 
##    429435 character character
#do these match?
all(geocodes$PAT_ID_DEID==geocodio_input$PAT_ID_DEID)
## [1] FALSE
#get a count for how many match
id_matches <- data.frame(a = geocodes$PAT_ID_DEID,
                         b = geocodio_input$PAT_ID_DEID) %>%
              transform(diff = ifelse(a==b, 0, 1))
sum(id_matches$diff) #429435
## [1] 429435

Seems like something happened in the process of getting geocodes and all patient identifiers were corrupted. We will use the original patient identifiers from the input and drop from the output assuming it’s one-to-one and order was preserved.

# bind the two into one
geo <- bind_cols(PAT_ID_DEID = geocodio_input$PAT_ID_DEID, geocodes[,-(1:2)])

colnames(patient_data)
##  [1] "PAT_ID_DEID"    "birth_date"     "sex"            "race"           "ethnic_group"   "marital_status" "hispanic_yn"    "race_ethnicity" "first_enc"      "age_firstenc"   "agecat"         "earliest_enc"   "latest_enc"     "years_followed" "ast"            "copd"           "OCS"            "SABA"           "SAMA"           "LABA"           "LAMA"           "ICS"            "asthma_basic"  
## [24] "copd_basic"     "asthma_persist" "copd_persist"   "diseasecat"     "persistdzcat"
#merge patient geocodes with patient data table
pat <- left_join(patient_data,
                  geo[,c("PAT_ID_DEID", "Longitude", "Latitude", "Accuracy.Score")])
## Joining, by = "PAT_ID_DEID"
dim(pat) # 382606, 31
## [1] 382606     31

Next we take a look at our lab results and try to map them to our patient data

colnames(lab_results)
## [1] "PAT_ID_DEID"     "ORDER_DATE"      "RESULT_CAT"      "INFORMED_CAT"    "FINAL_RESULT"    "ApprovedService" "preop"           "hospscreen"      "omscreen"
summary(lab_results$FINAL_RESULT)
##      Positive      Negative Indeterminate       Pending          NA's 
##         19482        229867           989           372         63825
ggplot(data=lab_results, aes(x=FINAL_RESULT)) + geom_bar()

# We eliminate "Indeterminate", "Pending" and "NA's" for the sake of simplicity.
pat_lab <- lab_results %>%
  dplyr::filter(FINAL_RESULT %in% c("Positive", "Negative")) %>%
  group_by(PAT_ID_DEID) %>%
  mutate(positive = as.factor(as.numeric(ifelse(FINAL_RESULT == "Positive", 1, NA))),#as.factor(as.numeric(sum(FINAL_RESULT == "Positive") > 0)),
         covidtest = as.factor(as.numeric(ifelse(FINAL_RESULT == "Positive", 1, 0))),
         preop = as.factor(as.numeric(ifelse(preop==1,1,0))),
         hospscreen = as.factor(as.numeric(ifelse(hospscreen==1,1,0))),
         omscreen = as.factor(as.numeric(ifelse(omscreen==1,1,0))),
         month_date = format(as.Date(ORDER_DATE, "%Y-%m"))
         )

# Merge patient data with lab results on PAT_ID_DEID
pat_and_labs <- left_join(pat, pat_lab)
## Joining, by = "PAT_ID_DEID"

Next, we look at our geographic data, more specifically the vector data of our geographic regions of interest, Souteastern Pennsylvania and Philadelphia.

plot(sepa)
## Warning in wkt(obj): CRS object has no comment

#get county lines into the SEPA plot
plot(sepac)

class(sepac) #not an sp object, just a df
## [1] "data.frame"
head(sepac)
##        long      lat order  hole piece   group county
## 1 -75.88208 40.14260     1 FALSE     1 Berks.1  Berks
## 2 -75.88608 40.14522     2 FALSE     1 Berks.1  Berks
## 3 -75.89376 40.15022     3 FALSE     1 Berks.1  Berks
## 4 -75.89825 40.15315     4 FALSE     1 Berks.1  Berks
## 5 -75.90184 40.15550     5 FALSE     1 Berks.1  Berks
## 6 -75.94180 40.18159     6 FALSE     1 Berks.1  Berks
# we have to convert the df into a spdf 
#from df to sf
county_sf <- st_as_sf(sepac, coords=c("long", "lat"), crs=4326)
plot(county_sf)

# Looks like they have a few counties that are not considered SEPA, as according to our first plot, so we filter
filtered_counties <- dplyr::filter(county_sf, county %in% c("Bucks", "Montgomery", "Delaware", "Chester", "Philadelphia"))

# From the plot above, we can see they are just a collection of points, not polygons so we need to make appropriate conversion
county_polys <- aggregate(x = filtered_counties$geometry,
                          by= list(filtered_counties$group),
                          FUN=create_polygon) %>% st_sf() #create_polygon from helper_funcs.R
plot(county_polys)

tmap_mode("view")
## tmap mode set to interactive viewing
tmap_options(check.and.fix = TRUE)
tm_shape(sepa) + tm_polygons()+
tm_shape(county_polys) + tm_polygons()
plot(phl_blocks)
## Warning: plotting the first 9 out of 15 attributes; use max.plot = 15 to plot all

#we just need the geometry for now
plot(st_geometry(phl_blocks))

head(adi)
##           gisjoin        fips adi_natrank adi_staternk
## 1 G01000100208022 10010208022          35            2
## 2 G01000100208023 10010208023          61            4
## 3 G01000100208024 10010208024          48            2
## 4 G01007300129114 10730129114          13            1
## 5 G01007300129121 10730129121          74            6
## 6 G01007300129122 10730129122          63            4
table(adi$adi_natrank)
## 
##     1    10   100    11    12    13    14    15    16    17    18    19     2    20    21    22    23    24    25    26    27    28    29     3    30    31    32    33    34    35    36    37    38    39     4    40    41    42    43    44    45    46    47    48    49     5    50    51    52    53    54    55    56    57    58    59     6    60    61    62    63    64    65    66    67    68 
##  2158  2157  2157  2158  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2158  2157  2157  2157  2157  2158  2157  2157  2158  2157  2157  2158  2157  2158  2157  2157  2158  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2158  2157  2157  2157  2157  2158  2157  2157  2158  2157  2157 
##    69     7    70    71    72    73    74    75    76    77    78    79     8    80    81    82    83    84    85    86    87    88    89     9    90    91    92    93    94    95    96    97    98    99    GQ GQ-PH    PH 
##  2158  2158  2157  2157  2157  2158  2157  2157  2158  2157  2157  2157  2157  2158  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2157  2158  2157  2157  2158  2157  2157  2660   595  1349
#since the fips column is the same as the GEOID column in our block level philadelphia geodata, let's just merge them
#GQ, PH and GQ-PH are indications that they are missing due to falling into some type of suppression criteria.
adi_cleaned <- adi %>%
       replace_with_na(replace = list(adi_natrank = c("GQ", "PH", "GQ-PH"),
                                      adi_staternk = c("GQ", "PH", "GQ-PH"))) %>%
       mutate(adi_natrank = as.numeric(adi_natrank),
              adi_staternk = as.numeric(adi_staternk),
              fips = as.character(fips)) %>%
       rename(GEOID10 = fips)
phl <- left_join(phl_blocks, adi_cleaned)
## Joining, by = "GEOID10"

We start by looking at where our patients are located across Southeastern PA

# Results are hidden due to PHI. Reason: geolocation of patients shown.

#Let's first get rid of missing x, y coords since they can't be plotted.
n_miss(pat_and_labs$Longitude)
pat_and_labs <- drop_na(pat_and_labs, Longitude)
# Results are hidden due to PHI. Reason: geolocation of patients shown.
plot(sepa, main=NA)
plot(county_polys, add=TRUE)
plot(st_as_sf(pat_and_labs, coords=c("Longitude","Latitude")), add=TRUE)
vis_miss(pat_and_labs, warn_large_data=F)

This tells us we really need to drop a few variables, namely ApprovedService, ORDER_DATE, diseasecat, persistdzcat, and INFORMED_CAT

colnames(pat_and_labs)
##  [1] "PAT_ID_DEID"     "birth_date"      "sex"             "race"            "ethnic_group"    "marital_status"  "hispanic_yn"     "race_ethnicity"  "first_enc"       "age_firstenc"    "agecat"          "earliest_enc"    "latest_enc"      "years_followed"  "ast"             "copd"            "OCS"             "SABA"            "SAMA"            "LABA"            "LAMA"            "ICS"            
## [23] "asthma_basic"    "copd_basic"      "asthma_persist"  "copd_persist"    "diseasecat"      "persistdzcat"    "Longitude"       "Latitude"        "Accuracy.Score"  "ORDER_DATE"      "RESULT_CAT"      "INFORMED_CAT"    "FINAL_RESULT"    "ApprovedService" "preop"           "hospscreen"      "omscreen"        "positive"        "covidtest"       "month_date"
n_miss(pat_and_labs$ApprovedService)
## [1] 233034
n_miss(pat_and_labs$diseasecat)
## [1] 399879
n_miss(pat_and_labs$persistdzcat)
## [1] 432035
n_miss(pat_and_labs$INFORMED_CAT)
## [1] 402431

Since there are patients whose locations are outside of the areas on interest so we need to filter out non-SEPA resident-patients.

pat_points <- data.frame(x=pat_and_labs$Longitude, y=pat_and_labs$Latitude)
sepa_boundaries <- data.frame(sepa@polygons[[1]]@Polygons[[1]]@coords)

pats_in_sepa <- which(inout(pat_points, sepa_boundaries))
sepa_pats <- pat_and_labs[pats_in_sepa,]

dim(sepa_pats) #337517
## [1] 337517     42

For organization we can create a new data frame with our patient relevant data while also dropping columns we don’t need.

pat_data <- data.frame(covid = sepa_pats$covidtest,
                       Xcoord = sepa_pats$Longitude,
                       Ycoord = sepa_pats$Latitude,
                       case = sepa_pats$covidtest,
                       marital_status = sepa_pats$marital_status,
                       race_eth = sepa_pats$race_ethnicity,
                       sex = sepa_pats$sex,
                       age = sepa_pats$agecat,
                       yf = sepa_pats$years_followed,
                       ast = sepa_pats$ast,
                       copd = sepa_pats$copd,
                       ocs = sepa_pats$OCS,
                       saba = sepa_pats$SABA,
                       sama = sepa_pats$SAMA,
                       laba = sepa_pats$LABA,
                       lama = sepa_pats$LAMA,
                       ics = sepa_pats$ICS,
                       preop = sepa_pats$preop,
                       hs = sepa_pats$hospscreen,
                       omscreen = sepa_pats$omscreen)

With our data cleaned up, let’s start by testing the idea of covid disproportinately affecting people of color based on covid positive tests.

na.omitted <- dplyr::filter(pat_data, !is.na(covid))
ggplot(data=na.omitted, aes(x=race_eth, fill=covid, na.rm=TRUE)) +
geom_bar(position="dodge")

It’s clearly visible for Non-Hispanic black that there is a higher positive rate compared to Non-hispanic white despite having less testing numbers overall. We can start plotting our patients on the map and see where the ones that test positive compare against demographic factors like race, and socioeconomic factors like median income and area depravation.

#for the sake of speeding up our plots we're going to reduce our patient number with those that have complete cases
sum(complete.cases(pat_data))
## [1] 172284
pat_data_complete <- pat_data[complete.cases(pat_data),]
projcrs <- "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
pat_geo <- st_as_sf(x = pat_data_complete,
                    coords = c("Xcoord", "Ycoord"),
                    crs = projcrs)
dim(pat_geo)
## [1] 172284     19

Racial breakdown of patients in SEPA

# Results are hidden due to PHI. Reason: geolocation of patients shown.

tm_shape(sepa) +
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
  tm_fill(col="#e7f6f7") +
tm_shape(pat_geo) + tm_symbols(col="race_eth", size=0.02, border.alpha=0, alpha=0.5)

Positive tests among patients in SEPA

# Results are hidden due to PHI. Reason: geolocation of patients shown.
tm_shape(sepa) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
  tm_fill(col="#e7f6f7") +
tm_shape(pat_geo) +
  tm_symbols(col="case", palette=c("grey90", "blue"), size=0.04, border.alpha=0, alpha=0.6) +
tm_layout("Covid-19 Positive(1) and Negative(0) tests",
            legend.position=c("right", "bottom"),
            )
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

The same plot except for Philadelphia.

# Results are hidden due to PHI. Reason: geolocation of patients shown.
tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
  tm_fill(col="#e7f6f7") +
tm_shape(pat_geo) +
  tm_symbols(col="case", palette=c("grey90", "blue"), size=0.04, border.alpha=0, alpha=0.6) +
tm_layout("Covid-19 Positive(1) and Negative(0) tests",
            legend.position=c("right", "bottom"),
            )
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

Instead of looking at dots don’t seem too informative, and not getting a sense of density of overlapping dots, we can perform density estimation and encode that into hexagonal grids that get darker in color based on higher counts within grid points.

# Results are hidden due to PHI. Reason: geolocation of patients shown.
points_hex <- hexbin_map(as(pat_geo, "Spatial"), bins=120)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
tm_shape(sepa) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex) +
  tm_fill(col='z', alpha=0.8) +
tm_layout("Covid-19 Positive tests",
            legend.position=c("right", "bottom"),
            )
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

If we look at Philadelphia, it’s clearly visible that the higher densities of covid positivity are in West and South Philadelphia.

# Results are hidden due to PHI. Reason: geolocation of patients shown.
points_hex <- hexbin_map(as(pat_geo, "Spatial"), bins=120)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex) +
  tm_fill(col='z', alpha=0.8) +
tm_layout("Covid-19 Positive tests",
            legend.position=c("right", "bottom"),
            )
## legend.postion is used for plot mode. Use view.legend.position in tm_view to set the legend position in view mode.

We can characterize those areas of Philadelphia with some publicly available data.

#Area Depravation Index
tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 0.8, lty = "solid", alpha = NA) +
  tm_fill(col = "adi_natrank", style="jenks")

Median Household Income

colnames(pop)
##  [1] "STATEFP10"  "COUNTYFP10" "TRACTCE10"  "GEOID10"    "NAME10"     "NAMELSAD10" "MTFCC10"    "FUNCSTAT10" "ALAND10"    "AWATER10"   "INTPTLAT10" "INTPTLON10" "GISJOIN"    "Shape_area" "Shape_len"  "medHHinc"   "totalPop"   "geometry"
tm_shape(pop) + tm_polygons(col="medHHinc", palette="RdYlGn")
# Results are hidden due to PHI. Reason: geolocation of patients shown.
pat_geo.white <- dplyr::filter(pat_geo, race_eth=="Non-Hispanic white")
pat_geo.black <- dplyr::filter(pat_geo, race_eth=="Non-Hispanic black")
pat_geo.hisp <- dplyr::filter(pat_geo, race_eth=="Hispanic")
pat_geo.other <- dplyr::filter(pat_geo, race_eth=="Other/unknown"|race_eth=="Asian/Pacific Islander")
points_hex.white <- hexbin_map(as(pat_geo.white, "Spatial"), bins=90)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
points_hex.black <- hexbin_map(as(pat_geo.black, "Spatial"), bins=90)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
points_hex.hisp <- hexbin_map(as(pat_geo.hisp, "Spatial"), bins=90)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
points_hex.other <- hexbin_map(as(pat_geo.other, "Spatial"), bins=90)
## Warning in proj4string(spdf): CRS object has comment, which is lost in output; in tests, see
## https://cran.r-project.org/web/packages/sp/vignettes/CRS_warnings.html
a <- tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex.white) +
  tm_fill(col='z', alpha=0.8, palette="OrRd") +
tm_layout("Non-Hispanic White")
b <- tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex.black) +
  tm_fill(col='z', alpha=0.8, palette="Greens") +
tm_layout("Non-Hispanic Black")
c <- tm_shape(phl) + 
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex.hisp) +
  tm_fill(col='z', alpha=0.8, palette="Blues") +
tm_layout("Hispanic")
d <- tm_shape(phl) +
  tm_borders(col = "#97d7f0", lwd = 2, lty = "solid", alpha = NA) +
tm_shape(points_hex.other) +
  tm_fill(col='z', alpha=0.8, palette="Purples") +
tm_layout("Other")

m <- tmap_arrange(a, b, c, d)
m

I think we have enough surface evidence to say that there does seem to be some kind of correlation between race and socioeconomic factors and higher covid positive test rates. To put this theory to test, let’s first try a Generalized Additive Model (GAM) with a Two-Dimensional Smooth to see if X,Y coordinates is sufficient to be a predictor of covid positive outcome.

#trims the grid of X and Y coordinates to the boundaries of our SEPA map
sampled_data <- sample_n(pat_data_complete, 50000)
gamgrid <- predgrid(sampled_data,
                    map=sepa)
#By default the 1st variable will be outcome, the 2nd and 3rd variable are the geolocation coordinates, as in pat_data_complete
fit1 <- modgam(data=sampled_data, rgrid=gamgrid, m="crude", sp=0.5)
## The crude model is: 
## covid ~ lo(Xcoord, Ycoord, span = 0.5, degree = 1)
## Family: binomial Link: logit
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : liv too small. (Discovered by lowesd)
## Warning in lo.wam(x, z, wz, fit$smooth, which, fit$smooth.frame, bf.maxit, : lv too small. (Discovered by lowesd)
## GAM predictions will use median values at all grid points 
##        for variables not included in newdata
summary(fit1$exp.fi)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.08203 0.57040 1.00000 1.02462 1.30243 3.31798
summary(fit1$fit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.5007 -0.5614  0.0000 -0.1929  0.2642  1.1994
# There is a weird bug where if you dont have "+units=" in the proj4string,
# it automatically throws an error, when usually one is inferred.
# Tried multiples times to fix this but never managed to get it fixed.
# so I ended up just removing the proj4string altogether.
crs(sepa) <- NA
plot(fit1, sepa)

Now let’s try to fit a logistic regression with the rest of the variables we have, excluding the coordinates.

#drop coordinates
pat_data.glm <- subset(sampled_data, select = -c(case, Xcoord, Ycoord))
pat.glm <- glm(covid ~ .,
               data = pat_data.glm,
               family="binomial")
summary(pat.glm)
## 
## Call:
## glm(formula = covid ~ ., family = "binomial", data = pat_data.glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0145  -0.4748  -0.3539  -0.1699   3.4632  
## 
## Coefficients:
##                                Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)                    -1.94773    0.11936 -16.318 < 0.0000000000000002 ***
## marital_statusDOMESTIC PARTNER -0.30704    0.32995  -0.931             0.352084    
## marital_statusMARRIED           0.25109    0.08390   2.993             0.002766 ** 
## marital_statusNULL             -9.95196   97.04236  -0.103             0.918318    
## marital_statusOTHER             0.48070    0.13738   3.499             0.000467 ***
## marital_statusPARTNER          -0.14929    0.23355  -0.639             0.522665    
## marital_statusSEPARATED         0.23002    0.14516   1.585             0.113066    
## marital_statusSINGLE            0.23111    0.08460   2.732             0.006298 ** 
## marital_statusWIDOWED           0.24728    0.11480   2.154             0.031238 *  
## race_ethHispanic                0.88207    0.09500   9.285 < 0.0000000000000002 ***
## race_ethNon-Hispanic black      0.70816    0.08364   8.467 < 0.0000000000000002 ***
## race_ethNon-Hispanic white     -0.47182    0.08416  -5.607    0.000000020642587 ***
## race_ethOther/unknown           0.05557    0.10386   0.535             0.592622    
## sexMALE                         0.06279    0.03338   1.881             0.059923 .  
## age35-54                        0.12610    0.04087   3.086             0.002032 ** 
## age55-74                       -0.06119    0.04777  -1.281             0.200208    
## age75+                         -0.31276    0.08162  -3.832             0.000127 ***
## yf                             -0.08160    0.01112  -7.341    0.000000000000212 ***
## ast                            -0.18016    0.06248  -2.883             0.003935 ** 
## copd                           -0.47861    0.11252  -4.254    0.000021019450157 ***
## ocs                            -0.36335    0.04305  -8.440 < 0.0000000000000002 ***
## saba                            0.08283    0.05306   1.561             0.118508    
## sama                           -0.07824    0.08070  -0.970             0.332285    
## laba                           -0.35439    0.11822  -2.998             0.002720 ** 
## lama                            0.20544    0.14135   1.453             0.146100    
## ics                             0.26454    0.09849   2.686             0.007230 ** 
## preop1                         -1.06714    0.11149  -9.571 < 0.0000000000000002 ***
## hs1                            -1.39955    0.06576 -21.282 < 0.0000000000000002 ***
## omscreen1                      -0.43848    0.09200  -4.766    0.000001879027997 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 31129  on 49999  degrees of freedom
## Residual deviance: 27340  on 49971  degrees of freedom
## AIC: 27398
## 
## Number of Fisher Scoring iterations: 10

Results

The study shows to a certain extent that there is a correlation between geospatial, demographic and socioeconomic features and Covid-19 positive tests. In the regression analysis, race, age and screening were shown to be significant variables. In the race variable, Nonhispanic white has a negative estimate while Nonhispanic black has a positive estimate which further supports the argument that covid has had a much greater negative impact on African Americans.

Looking Ahead

This study was very limited, looking only at covid positive testing as a health outcome, but could further be expanded to hospitalization and mortality rates to further show the effects of covid on disadvantaged sections of the population. Additionally, not much of the temporal aspect of the data was utilized and that could provide a much richer and deeper analysis of the determinants of Covid-19.